19 research outputs found

    The OpenDC Microservice Simulator: Design, Implementation, and Experimentation

    Full text link
    Microservices is an architectural style that structures an application as a collection of loosely coupled services, making it easy for developers to build and scale their applications. The microservices architecture approach differs from the traditional monolithic style of treating software development as a single entity. Microservice architecture is becoming more and more adapted. However, microservice systems can be complex due to dependencies between the microservices, resulting in unpredictable performance at a large scale. Simulation is a cheap and fast way to investigate the performance of microservices in more detail. This study aims to build a microservices simulator for evaluating and comparing microservices based applications. The microservices reference architecture is designed. The architecture is used as the basis for a simulator. The simulator implementation uses statistical models to generate the workload. The compelling features added to the simulator include concurrent execution of microservices, configurable request depth, three load-balancing policies and four request execution order policies. This paper contains two experiments to show the simulator usage. The first experiment covers request execution order policies at the microservice instance. The second experiment compares load balancing policies across microservice instances.Comment: Bachelor's thesi

    Characterization of a big data storage workload in the cloud

    Get PDF
    The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats

    The workflow trace archive:Open-access data from public and private computing infrastructures

    Get PDF
    Realistic, relevant, and reproducible experiments often need input traces collected from real-world environments. In this work, we focus on traces of workflows - common in datacenters, clouds, and HPC infrastructures. We show that the state-of-the-art in using workflow-traces raises important issues: (1) the use of realistic traces is infrequent and (2) the use of realistic, open-access traces even more so. Alleviating these issues, we introduce the Workflow Trace Archive (WTA), an open-access archive of workflow traces from diverse computing infrastructures and tooling to parse, validate, and analyze traces. The WTA includes {>}48>48 million workflows captured from {>}10>10 computing infrastructures, representing a broad diversity of trace domains and characteristics. To emphasize the importance of trace diversity, we characterize the WTA contents and analyze in simulation the impact of trace diversity on experiment results. Our results indicate significant differences in characteristics, properties, and workflow structures between workload sources, domains, and fields

    The Workflow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures -- Technical Report

    Get PDF
    Realistic, relevant, and reproducible experiments often need input traces collected from real-world environments. We focus in this work on traces of workflows---common in datacenters, clouds, and HPC infrastructures. We show that the state-of-the-art in using workflow-traces raises important issues: (1) the use of realistic traces is infrequent, and (2) the use of realistic, {\it open-access} traces even more so. Alleviating these issues, we introduce the Workflow Trace Archive (WTA), an open-access archive of workflow traces from diverse computing infrastructures and tooling to parse, validate, and analyze traces. The WTA includes >48{>}48 million workflows captured from >10{>}10 computing infrastructures, representing a broad diversity of trace domains and characteristics. To emphasize the importance of trace diversity, we characterize the WTA contents and analyze in simulation the impact of trace diversity on experiment results. Our results indicate significant differences in characteristics, properties, and workflow structures between workload sources, domains, and fields.Comment: Technical repor

    Workload Characterization and Modeling, and the Design and Evaluation of Cache Policies for Big Data Storage Workloads in the Cloud

    No full text
    The proliferation of big-data processing platforms has already led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems enables tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. In this work, we focus on understanding the behavior and cache performance of the storage sub-system used for Spark workloads in the cloud. First, we statistically characterize its usage. Second, we design a generative model to tackle the scarcity of workload traces. Third, we design a cache policy putting our insight from the characterization to work. Finally, we evaluate the performance of different cache policies for big data workloads via simulation.Computer Science | Software Technolog

    Efficient Estimation of Read Density when Caching for Big Data Processing

    No full text
    Big data processing systems are becoming increasingly more present in cloud workloads. Consequently, they are starting to incorporate more sophisticated mechanisms from traditional database and distributed systems. We focus in this work on the use of caching policies, which for big data raise important new challenges. Not only they must respond to new variants of the trade-off between hit rate, response time, and the space consumed by the cache, but they must do so at possibly higher volume and velocity than web and database workloads. Previous caching policies have not been tested experimentally with big data workloads. We address these challenges in this work. We propose the Read Density family of policies, which is a principled approach to quantify the utility of cached objects through a family of utility functions that depend on the frequency of reads of an object. We further design the Approximate Histogram, which is a policy-based technique based on an array of counters. This technique promises to achieve runtime-space efficient computation of the metric required by the cache policy. We evaluate through trace-based simulation the caching policies from the Read Density family, and compare them with over ten state-of-the-art alternatives. We use two workload traces representative for big data processing, collected from commercial Spark and MapReduce deployments. While we achieve comparable performance to the state-of-art with less parameters, meaningful performance improvement for big data workloads remain elusive

    How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster

    No full text
    Reliable job execution is important in High Performance Computing clusters. Understanding the failure distribution and failure pattern of jobs helps HPC cluster managers design better systems, and users design fault tolerant systems. Machine learning is an increasingly popular workload for HPC clusters are used for. But, there is little information on machine learning job failure characteristics on HPC clusters, and how they differ from the previous workload such clusters were used for. The goal of our work is to improve the understanding of machine learning job failures in HPC clusters. We collect and analyze job data spanning the whole of 2022, and over 2∼million jobs. We analyze basic statistical characteristics, the time pattern of failures, resource waste caused by failures, and their autocorrelation. Some of our findings are that machine learning jobs fail at a higher rate than non-ML jobs, and waste much more CPU-time per job when they fail.</p

    A Reference Architecture for Datacenter Scheduler Programming Abstractions: Design and Experiments (Work In Progress Paper)

    No full text
    Datacenters are the backbone of our digital society, used by the industry, academic researchers, public institutions, etc. To manage resources, data centers make use of sophisticated schedulers. Each scheduler offers a different set of capabilities and users make use of them through the APIs they offer. However, there is not a clear understanding of what programming abstractions they offer, nor why they offer some and not others. Consequently, it is difficult to understand the differences between them and the performance costs that are imposed by their APIs. In this work, we study the programming abstractions offered by industrial schedulers, their shortcomings, and the performance costs of the shortcomings. We propose a general reference architecture for scheduler programming abstractions. Specifically, we analyze the programming abstractions of five popular industrial schedulers, we analyze the differences in their APIs, we identify the missing abstractions, and finally, we carry out an exemplary experiment to demonstrate that schedulers sacrifice performance by under-implementing programming abstractions. In the experiments, we demonstrate that an API extension can improve task runtime by up to 23%. This work allows schedulers to identify their shortcomings and points of improvement in their APIs, but most importantly, provides a reference architecture for existing and future schedulers

    A Trace-driven Performance Evaluation of Hash-based Task Placement Algorithms for Cache-enabled Serverless Computing

    No full text
    Data-driven interactive computation is widely used for business analytics, search-based decision-making, and log mining. These applications' short duration and bursty nature makes them a natural fit for serverless computing. Data processing serverless applications are composed of many small tasks. Application tasks that use remote storage encounter bottlenecks in the form of high latency, performance variability, and throttling. Caching has been used to mitigate this bottleneck for intermediate data. However, the use of caching for input data, albeit widely used in industry, has yet to be studied. We present the first performance study of scaling, a key feature of serverless computing, on serverless clusters with input data caches. We compare 8 task placement algorithms and quantify their impact on task slowdown and resource usage before and after scaling. We quantify the consequences of using work stealing. We quantify the performance impact of scaling in the buffer period immediately after scaling. We find up to a 420% increase in task slowdown after scaling without work stealing and a 22% slowdown with work stealing. We also find that cache misses after scaling can lead to an additional 21% resource usage.</p
    corecore